Loan Data Exploration by Wan Li

This data set contains 113,937 loans with 81 variables on each loan. I select the following 13 variables to explore: ListingKey, LoanStatus, BorrowerRate, ProsperScore, EmploymentStatus, CreditScoreRangeLower, CreditScoreRangeUpper, StatedMonthlyIncome, LoanMonthsSinceOrigination, LoanOriginalAmount, MonthlyLoanPayment, LP_CustomerPayments, Term.

Univariate Plots Section

##                ListingKey LoanStatus BorrowerRate ProsperScore
## 1 1021339766868145413AB3B  Completed       0.1580           NA
## 2 10273602499503308B223C1    Current       0.0920            7
## 3 0EE9337825851032864889A  Completed       0.2750           NA
## 4 0EF5356002482715299901A    Current       0.0974            9
## 5 0F023589499656230C5E3E2    Current       0.2085            4
## 6 0F05359734824199381F61D    Current       0.1314           10
##   EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
## 1    Self-employed                   640                   659
## 2         Employed                   680                   699
## 3    Not available                   480                   499
## 4         Employed                   800                   819
## 5         Employed                   680                   699
## 6         Employed                   740                   759
##   StatedMonthlyIncome LoanMonthsSinceOrigination LoanOriginalAmount
## 1            3083.333                         78               9425
## 2            6125.000                          0              10000
## 3            2083.333                         86               3001
## 4            2875.000                         16              10000
## 5            9583.333                          6              15000
## 6            8333.333                          3              15000
##   MonthlyLoanPayment LP_CustomerPayments Term
## 1             330.43            11396.14   36
## 2             318.93                0.00   36
## 3             123.32             4186.63   36
## 4             321.45             5143.20   36
## 5             563.97             2819.85   36
## 6             342.37              679.34   60
## [1] 113937     13
## 'data.frame':    113937 obs. of  13 variables:
##  $ ListingKey                : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ LoanStatus                : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerRate              : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ProsperScore              : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ EmploymentStatus          : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ CreditScoreRangeLower     : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper     : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ StatedMonthlyIncome       : num  3083 6125 2083 2875 9583 ...
##  $ LoanMonthsSinceOrigination: int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanOriginalAmount        : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ MonthlyLoanPayment        : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments       : num  11396 0 4187 5143 2820 ...
##  $ Term                      : int  36 36 36 36 36 60 36 36 36 36 ...
##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (>120 days)"   "Past Due (1-15 days)"  
##  [9] "Past Due (16-30 days)"  "Past Due (31-60 days)" 
## [11] "Past Due (61-90 days)"  "Past Due (91-120 days)"
## [1] ""              "Employed"      "Full-time"     "Not available"
## [5] "Not employed"  "Other"         "Part-time"     "Retired"      
## [9] "Self-employed"

I subset the dataset to include only 12 variables with 113937 observation.

## [1] FALSE
## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

The levels in LoanStatus is not ordered, and there are too many levels for past due.

## 
##              Completed FinalPaymentInProgress                Current 
##                  38074                    205                  56576 
##   Past Due (1-15 days)  Past Due (16-30 days)  Past Due (31-60 days) 
##                    806                    265                    363 
##  Past Due (61-90 days) Past Due (91-120 days)   Past Due (>120 days) 
##                    313                    304                     16 
##              Defaulted             Chargedoff              Cancelled 
##                   5018                  11992                      5

I ordered the LoanStatus levels from completed, nearly completed, current, to past due, defaulted, chargedoff and cancelled.

## 
##  Completed    Current    PastDue  Defaulted Chargedoff  Cancelled 
##      38074      56781       2067       5018      11992          5

I combined the pastdue levels together, and FinalPaymentInProgress to Current.The majority of the observations are current or completed, and the minority are PastDue, Defaulted and Chargedoff. The cancelled loans are very rare.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

The highest BorrowRate is 0.4975 and the lowest is 0. The majority of observations have the BorrowRate between 0.05 and 0.35. The interval near 0.32 has an unusual high frequency. I wonder what factors influence the BorrowRate.

## Warning: Removed 29084 rows containing non-finite values (stat_count).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    4.00    6.00    5.95    8.00   11.00   29084

ProsperScore is discrete quantitative variable, ranging from 1 to 11. The distribution is quite symmetric with most people having medium ProsperScore between 4 and 8.

## [1] FALSE
## 
##                    Employed     Full-time Not available  Not employed 
##          2255         67322         26355          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3806          1088           795          6134

The levels of EmploymentStatus is not ordered and some levels overlap, like Employed and Full-time.

##      Employed Self-employed       Retired  Not employed        Unkown 
##         94765          6134           795           835         11408

I ordered the levels and combined the Full-time, Part-time and Employed to one level: Employed. Most of the observations have Employed status, with 6134 people being self-employed. The employment statuses of 11408 people are not clear. A small number of oberservations have Retired or Not employed status.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   660.0   680.0   685.6   720.0   880.0     591
## 
##     19 
## 113346

The CreditScoreRangeLower equals CreditScoreRangeLower plus 19. I create a new variable CreditScore, which is average of CreditScoreRangeLower and CreditScoreRangeUpper. The CreditScore have discrete values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591
##  [1] 649.5 689.5 489.5 809.5 749.5 709.5 829.5 769.5 669.5 629.5 729.5
## [12] 529.5 789.5 609.5 589.5 549.5 569.5 509.5 849.5 869.5    NA 469.5
## [23]   9.5 889.5 449.5 429.5 369.5

Most people have CreditScore between 600 and 800.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3200    4667    5608    6825 1750003

There are some outliers in StatedMonthlyIncome. So I trimmed the highest 1% of the data.Distribution of StatedMonthlyIncome is long-tailed, skewed to the right. So I log-transformed the data. The distribution of transformed data is close to normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     6.0    21.0    31.9    65.0   100.0

Much more people started loan in the last 10 months than earlier months, which shows economic resurgence. There is alomost no people starting loans 55-64 months ago, which may be because of the economic crisis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000
## [1] 14333

The amount of the original loan ranges from $1000 to $35000. 3 quarters of the loans are below 12000. Most of the loans are close to certain numbers like $4000, $10000, $1500, $5000. For example, there are 14333 loans with the amount of $4000.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

Most loans have less than $400 MonthlyLoanPayment. The interval near 170 has the highest frequency.The log-transformed distribution apears to be bimodal peaking around 160 and again around 330.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    -2.35  1005.76  2583.83  4183.08  5548.40 40702.39

Distribution of LP_CustomerPayments is long-tailed.Log-transformed LP_CustomerPayments distribution appears close to normal distribution with peak at about 4000.

## 
##    12    36    60 
##  1614 87778 24545

There are 3 unique values for Term. The majority of observations have term of 36 months and term of 12 months is quite rare. I change the datatype of Term to factor.

Univariate Analysis

Structure of The Dataset

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. I select the following 12 variables to explore: ListingKey, LoanStatus, BorrowerRate, ProsperScore, EmploymentStatus, CreditScoreRangeLower, CreditScoreRangeUpper, StatedMonthlyIncome, LoanMonthsSinceOrigination, LoanOriginalAmount, MonthlyLoanPayment, LP_CustomerPayments, Term.

ListingKey is a factor variable showing the unique identifier of each observation. The other variables are features of the observation. The variables LoanStatus and EmploymentStatus are unordered factors. The other variables are numerical or integer variables. I created new variable CreditScore.

  • LoanStatus: Most observations are Current or Completed.
  • EmploymentStatus: Most obeservations has Employed EmploymentStatus.
  • BorrowerRate: The majority of the Borrow Rate fall between 0.05 and 0.35.
  • ProsperScore: Most people having medium ProsperScore between 4 and 8.
  • CreditScore: Most people have CreditScore between 600 and 800.
  • StatedMonthlyIncome: The median StatedMonthlyIncome is 4667.
  • LoanMonthsSinceOrigination: Much more people started loan in the last 10 months than earlier months. There is a break between 55 and 64.
  • LoanOriginalAmount: Ranging from 1000 to 35000. The median is 6500.
  • MonthlyLoanPayment: The median MonthlyLoanPayment is 217.7.
  • LP_CustomerPayments: The median LP_CustomerPayments is 2583.83.
  • Term: Most loans have term of 36 months.

Main Features of Interest

The main features of interest in the dataset are BorrowerRate, ProsperScore and CreditScore. I’d like to explore which feautures are mostly related to BorrowerRate. I think ProsperScore, CreditScore and other variables may be used to decide BorrowRate.

Other Features in the Dataset

I guess BorrowRate is related with ProsperScore and CreditScore. Other features like LoanStatus, EmploymentStatus, CreditScore, StatedMonthlyIncome, LoanOriginalAmount, and MonthlyLoanPayment may also influence BorrowRate.

New Variable from Existing Variables

The CreditScoreRangeLower equals CreditScoreRangeLower plus 19. I create a new variable CreditScore to represent the credit condition, which is average of CreditScoreRangeLower and CreditScoreRangeUpper. The CreditScore have discrete values.

Unusual Distributions, Data Wrangling and Transformation

Several variables like StatedMonthlyIncome, LoanOriginalAmount, MonthlyLoanPayment, ’LP_CustomerPayments, have skewed distribution. I trimed highest values and log-transformed these data.

The distribution of LoanMonthsSinceOrigination shows that there is alomost no people starting loans 55-64 months ago, which may be because of the economic crisis.

The factor variables LoanStatus and EmploymentStatus has messy, unordered levels. I combined some levels and ordered the levels. Now the levels of EmploymentStatus are Employed, Self-employed, Retired, Not employed, Unkown. The levels of LoanStatus are Completed, Current, PastDue, Defaulted, Chargedoff, Cancelled.

Bivariate Plots Section

From a subset of the data, BorrowerRate is most strongly related with ProsperScore, followed by CreditScore and LoanOriginalAmount. StatedMonthlyIncome, LoanMonthsSinceOrigination, MonthlyLoanPayment and LP_CustomerPayments don’t have strong correlation with BorrowerRate. StatedMonthlyIncome has moderate relationship with LoanOriginalAmount. The boxplots show that LoanStatus and EmploymentStatus may influence the BorrowerRate. I would like to look closer at the scatter plots and box plots involving BorrowerRate and Other variables.

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$BorrowerRate and loan_sub$ProsperScore
## t = -248.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6536072 -0.6458311
## sample estimates:
##        cor 
## -0.6497361

The mean of BorrowerRate (red line) decreases aas ProsperScore increases. The variance of BorrowerRate is big, but the conditional mean shows a strong linear relationship. The correlation coefficient is -0.65. BorrowerRate has strong relationship with ProsperScore.

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$BorrowerRate and loan_sub$CreditScore
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4661358 -0.4569730
## sample estimates:
##        cor 
## -0.4615667

I trimmed the bottom and top 1% of the CreditScore data. The mean of BorrowerRate (red line) decreases as CreditScore increases. The variance of BorrowerRate is big. The correlation between these two variables is -0.46, which shows a moderate linear relationship.

The original plot of conditional mean (binwidth=100) has too much noise. I smoothed it with binwidth of 500 and add a smooth layer (blue line).

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$BorrowerRate and loan_sub$LoanOriginalAmount
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3341283 -0.3237719
## sample estimates:
##        cor 
## -0.3289599

The variance of BorrowerRate is big.As a whole, the conditional mean of BorrowRate decreases as LoanOriginalAmount increases, but there are many fluctuations. The correlation coefficient between these two variables is -0.33, showing a moderate relationship.

## loan_sub$LoanStatus: Completed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1173  0.1744  0.1864  0.2511  0.4975 
## -------------------------------------------------------- 
## loan_sub$LoanStatus: Current
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0577  0.1314  0.1760  0.1838  0.2310  0.3304 
## -------------------------------------------------------- 
## loan_sub$LoanStatus: PastDue
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0599  0.1885  0.2399  0.2344  0.2916  0.3435 
## -------------------------------------------------------- 
## loan_sub$LoanStatus: Defaulted
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1650  0.2296  0.2231  0.2875  0.4975 
## -------------------------------------------------------- 
## loan_sub$LoanStatus: Chargedoff
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1769  0.2400  0.2354  0.2975  0.4500 
## -------------------------------------------------------- 
## loan_sub$LoanStatus: Cancelled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1075  0.1395  0.2000  0.1844  0.2375  0.2375

The means of BorrowerRate for PastDue, Defaulted and Chargedoff loans are higher than Completed and Current loans.

## loan_sub$EmploymentStatus: Employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1314  0.1800  0.1911  0.2498  0.3600 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Self-employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1400  0.1899  0.2023  0.2695  0.3500 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Retired
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.1202  0.1829  0.1944  0.2625  0.3500 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1830  0.2599  0.2441  0.3134  0.3500 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Unkown
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1449  0.1989  0.1977  0.2537  0.4975

Not employed people have much higher median BorrowerRate than other groups.

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$LoanOriginalAmount and loan_sub$StatedMonthlyIncome
## t = 69.353, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1956816 0.2068243
## sample estimates:
##       cor 
## 0.2012595

The correlation between StatedMonthlyIncome and LoanOriginalAmount is not strong. But the condition mean shows that the mean of LoanOriginalAmount increases as StatedMonthlyIncome increases.

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$MonthlyLoanPayment and sqrt(loan_sub$StatedMonthlyIncome)
## t = 135.8, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3682467 0.3782419
## sample estimates:
##       cor 
## 0.3732551
## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$MonthlyLoanPayment and loan_sub$StatedMonthlyIncome
## t = 67.764, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1912423 0.2024055
## sample estimates:
##       cor 
## 0.1968303
## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$MonthlyLoanPayment and log10(loan_sub$StatedMonthlyIncome + 1)
## t = 78.974, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2223015 0.2333118
## sample estimates:
##       cor 
## 0.2278139

The conditional mean of MonthlyLoanPayment increases as StateMonthlyIncome increases. Sqrt-transformed StateMonthlyIncome seems to have better linear relationship with the condition mean of MonthlyLoanPayment. The Correlation between square root of StateMonthlyIncome and MonthlyLoanPayment is 0.37, showing a moderate linear relationship.

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$ProsperScore and loan_sub$CreditScore
## t = 115.87, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3637793 0.3753979
## sample estimates:
##      cor 
## 0.369603

ProsperScore and CreditScore have moderate positive relationship with correlation coefficient of 0.37.

## 
##  Pearson's product-moment correlation
## 
## data:  loan_sub$BorrowerRate and loan_sub$StatedMonthlyIncome
## t = -30.155, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09473938 -0.08321827
## sample estimates:
##        cor 
## -0.0889818

Conditional mean of BorrowerRate decreases as the StatedMonthlyIncome increases. But the variance is big and correlation is weak.

The term of loan only has a few discrete values, the MonthlyLoanPayment is mainly determined by LoanOriginalAmount, loan term and borrower rate.

## loan_sub$EmploymentStatus: Employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    3333    4981    5782    7083  483333 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Self-employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2734    4333    6338    7083 1750003 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Retired
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1575    2617    2987    3750   83333 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Not employed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0   196.5     0.0 20833.3 
## -------------------------------------------------------- 
## loan_sub$EmploymentStatus: Unkown
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2333    3500    4346    5250  208333

Employed people have higher income than Self-employed people who have higher income than retired people. Not employed people mostly don’t have any monthly income.

Current loans have the highest StatedMonthlyIncome. PastDue loans have lower StatedMonthlyIncome. Defaulted and Chargedoff loans has lower median StatedMonthlyIncome than PastDue loans. StatedMonthlyIncome of completed loans is in the middle.

Completed loans have the highest median ProsperScore. PastDue and Chargedoff loans have the lowest median Prosperscore.

Retired people have highest ProsperScore and self-employed people have lowest Prosperscore.

## NULL

Although BorrowerRate doesn’t seem to have any linear relationship with LoanMonthsSinceOrigination. BorrowerRate does seem to change with time. We can see that in different periods, the BorrowerRate has different ranges and patterns.

Loans with term of 12 tends to have lower BorrowerRate. Loans with term of 36 have larger range and variance.

Bivariate Analysis

Relationships Between Features of Interest and Other features

The mean of BorrowerRate decreases as ProsperScore increases. The variance of BorrowerRate is big, but the conditional mean shows a strong linear relationship. The correlation coefficient is -0.65. BorrowerRate has strong negative relationship with ProsperScore.

The mean of BorrowerRate decreases as CreditScore increases. The variance of BorrowerRate is big. The correlation between these two variables is -0.46, which shows a moderate linear relationship.

The conditional mean of BorrowerRate decreases as LoanOriginalAmount increases, but there are many fluctuations. The correlation coefficient between these two variables is -0.33, showing a moderate relationship.

The means of BorrowerRate for PastDue, Defaulted and Chargedoff loans are higher than Completed and Current loans.

Not employed people have much higher median BorrowerRate than other groups.

Conditional mean of BorrowerRate decreases as the StatedMonthlyIncome increases. But the variance is big and correlation is weak.

Although BorrowerRate doesn’t seem to have any linear relationship with LoanMonthsSinceOrigination. BorrowerRate does seem to change with time. We can see that in different periods, the BorrowerRate has different ranges and patterns.

Interesting Relationships between the Other Features

The correlation between StatedMonthlyIncome and LoanOriginalAmount is not strong. But the condition mean shows that the mean of LoanOriginalAmount increases as StatedMonthlyIncome increases.

The conditional mean of MonthlyLoanPayment increases as StateMonthlyIncome increases. Sqrt-transformed StateMonthlyIncome seems to have better linear relationship with the condition mean of MonthlyLoanPayment. The Correlation between square root of StateMonthlyIncome and MonthlyLoanPayment is 0.37, showing a moderate positive linear relationship.

ProsperScore and CreditScore have moderate positive relationship with correlation coefficient of 0.37.

The term of loan only has a few discrete values, the MonthlyLoanPayment is mainly determined by LoanOriginalAmount, loan term and borrower rate. MonthlyLoanPayment and LoanOriginalAmount have strong linear relationship.

Employed people have higher income than Self-employed people who have higher income than retired people. Not-employed people mostly don’t have any monthly income.

Current loans have the highest StatedMonthlyIncome. PastDue loans have lower StatedMonthlyIncome. Defaulted and Chargedoff loans has lower median StatedMonthlyIncome than PastDue loans. StatedMonthlyIncome of completed loans is in the middle.

Completed loans have the highest median ProsperScore. PastDue and Chargedoff loans have the lowest median Prosperscore.

Retired people have highest ProsperScore and self-employed people have lowest Prosperscore.

The Strongest Relationship I found

BorrowerRate has strong negative relationship with ProsperScore and moderate negative relationship with CreditScore. I think ProsperScore and Credit Score could be used in a model to predict the BorrwerRate.

Multivariate Plots Section

We look at the other variables against the main BorrowerRate vs ProsperScore relationship. I cut the CreditScore, LoanOriginalAmount and LoanMonthsSinceOrigination to create discete variables creditcut, amountcut and monthcut with only a few levels. I will only explore observations with a ProsperScore.

## 
##   (0,650] (650,700] (700,750] (750,800] (800,900] 
##     11542     28152     32515      8958      3686

## loan_sub$creditcut: (0,650]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00    5.00    4.99    7.00   11.00 
## -------------------------------------------------------- 
## loan_sub$creditcut: (650,700]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   5.358   7.000  11.000 
## -------------------------------------------------------- 
## loan_sub$creditcut: (700,750]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   6.151   8.000  11.000 
## -------------------------------------------------------- 
## loan_sub$creditcut: (750,800]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   8.000   7.319   9.000  11.000 
## -------------------------------------------------------- 
## loan_sub$creditcut: (800,900]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00    9.00    8.38   10.00   11.00

If we count for constant ProsperScore value, the pattern shows that loans with higher CreditScore tend to have lower BorrowerRate.This pattern holds across each level of amountcut.In terms of LoanStatus, this pattern is clear for Completed and Current loans, but is not so clear for PastDue, Defaulted and Chargedoff cases. In terms of EmploymentStatus, this pattern holds for Employed and Self-employed levels, but gets unclear for other levels. In terms of Term, the pattern is not clear for 12 months level.

From this plot we can see that the pattern changed a lot between (-1,13] and (18,38], which means the relationship between BorrowerRate and ProsperScore is different in different periods of time.

If we count for constant ProsperScore value, the pattern shows that Current loans tend to have lower BorrowerRate than PastDue loans and Completed loans. Defaulted and Chargedoff loans have similar, highest BorrowerRate.

If we count for constant ProsperScore value, the pattern shows that unemployed people tend to have highest BorrowerRate, Retired people have the second highest BorrowerRate, employed people in third place and self-employed have lowest BorrowerRate.

Loans with smaller original amount tend to have higher BorrowerRate.

The conditional mean of BorrowerRate with given ProsperScore is related with different periods of time. There are roughly 3 periods: 0-13, 13-38, 38-57. We can see the relationship between BorrowerRate and ProsperScore is more clear in loans within the last 13 month.

I will use these variables to build a linear model to predict the BorrowerRate.

## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loan_sub2)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loan_sub2)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut, 
##     data = loan_sub2)
## m4: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut + 
##     LoanStatus, data = loan_sub2)
## m5: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut + 
##     LoanStatus + EmploymentStatus, data = loan_sub2)
## m6: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut + 
##     LoanStatus + EmploymentStatus + monthcut, data = loan_sub2)
## m7: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut + 
##     LoanStatus + EmploymentStatus + monthcut + Term, data = loan_sub2)
## 
## =================================================================================================================================================================
##                                                        m1              m2              m3              m4              m5              m6              m7        
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                          0.317***        0.645***        0.604***        0.657***        0.656***        0.627***        0.580***  
##                                                       (0.001)         (0.003)         (0.003)         (0.003)         (0.003)         (0.002)         (0.002)    
##   ProsperScore                                        -0.020***       -0.017***       -0.015***       -0.016***       -0.016***       -0.018***       -0.018***  
##                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   CreditScore                                                         -0.000***       -0.000***       -0.000***       -0.000***       -0.000***       -0.000***  
##                                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   amountcut: (4000,10000]/(0,4000]                                                    -0.030***       -0.024***       -0.024***       -0.018***       -0.023***  
##                                                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   amountcut: (10000,15000]/(0,4000]                                                   -0.043***       -0.031***       -0.031***       -0.021***       -0.031***  
##                                                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   amountcut: (15000,35000]/(0,4000]                                                   -0.049***       -0.032***       -0.033***       -0.016***       -0.026***  
##                                                                                       (0.001)         (0.001)         (0.001)         (0.001)         (0.001)    
##   LoanStatus: Completed/Chargedoff                                                                    -0.026***       -0.025***       -0.021***       -0.017***  
##                                                                                                       (0.001)         (0.001)         (0.001)         (0.001)    
##   LoanStatus: Current/Chargedoff                                                                      -0.057***       -0.057***       -0.020***       -0.021***  
##                                                                                                       (0.001)         (0.001)         (0.001)         (0.001)    
##   LoanStatus: Defaulted/Chargedoff                                                                     0.002           0.002           0.001           0.001     
##                                                                                                       (0.002)         (0.002)         (0.001)         (0.001)    
##   LoanStatus: FinalPaymentInProgress/Chargedoff                                                       -0.048***       -0.048***       -0.021***       -0.018***  
##                                                                                                       (0.003)         (0.003)         (0.003)         (0.003)    
##   LoanStatus: Past Due (>120 days)/Chargedoff                                                         -0.012          -0.011           0.001          -0.003     
##                                                                                                       (0.012)         (0.012)         (0.010)         (0.010)    
##   LoanStatus: Past Due (1-15 days)/Chargedoff                                                         -0.030***       -0.031***       -0.010***       -0.011***  
##                                                                                                       (0.002)         (0.002)         (0.002)         (0.002)    
##   LoanStatus: Past Due (16-30 days)/Chargedoff                                                        -0.031***       -0.031***       -0.011***       -0.013***  
##                                                                                                       (0.003)         (0.003)         (0.003)         (0.002)    
##   LoanStatus: Past Due (31-60 days)/Chargedoff                                                        -0.027***       -0.027***       -0.005*         -0.007**   
##                                                                                                       (0.003)         (0.003)         (0.002)         (0.002)    
##   LoanStatus: Past Due (61-90 days)/Chargedoff                                                        -0.023***       -0.023***       -0.005*         -0.007**   
##                                                                                                       (0.003)         (0.003)         (0.002)         (0.002)    
##   LoanStatus: Past Due (91-120 days)/Chargedoff                                                       -0.021***       -0.021***       -0.005          -0.007**   
##                                                                                                       (0.003)         (0.003)         (0.002)         (0.002)    
##   EmploymentStatus: Full-time/Employed                                                                                -0.004***       -0.025***       -0.024***  
##                                                                                                                       (0.001)         (0.001)         (0.001)    
##   EmploymentStatus: Not employed/Employed                                                                              0.035***        0.021***        0.022***  
##                                                                                                                       (0.002)         (0.002)         (0.002)    
##   EmploymentStatus: Other/Employed                                                                                    -0.001           0.001           0.002**   
##                                                                                                                       (0.001)         (0.001)         (0.001)    
##   EmploymentStatus: Part-time/Employed                                                                                -0.007*         -0.029***       -0.029***  
##                                                                                                                       (0.003)         (0.003)         (0.003)    
##   EmploymentStatus: Retired/Employed                                                                                   0.001          -0.018***       -0.017***  
##                                                                                                                       (0.002)         (0.002)         (0.002)    
##   EmploymentStatus: Self-employed/Employed                                                                            -0.009***       -0.012***       -0.009***  
##                                                                                                                       (0.001)         (0.001)         (0.001)    
##   monthcut: (13,18]/(-1,13]                                                                                                            0.045***        0.044***  
##                                                                                                                                       (0.001)         (0.001)    
##   monthcut: (18,38]/(-1,13]                                                                                                            0.056***        0.058***  
##                                                                                                                                       (0.000)         (0.000)    
##   monthcut: (38,100]/(-1,13]                                                                                                           0.077***        0.079***  
##                                                                                                                                       (0.001)         (0.001)    
##   Term                                                                                                                                                 0.001***  
##                                                                                                                                                       (0.000)    
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                            0.422           0.506           0.559           0.614           0.617           0.694           0.719     
##   adj. R-squared                                       0.422           0.506           0.559           0.614           0.617           0.693           0.719     
##   sigma                                                0.057           0.052           0.050           0.046           0.046           0.041           0.040     
##   F                                                61989.938       43399.745       21501.861        9005.795        6506.329        7998.511        8693.474     
##   p                                                    0.000           0.000           0.000           0.000           0.000           0.000           0.000     
##   Log-likelihood                                  123078.549      129702.021      134535.681      140222.959      140522.009      149984.946      153706.340     
##   Deviance                                           273.094         233.621         208.465         182.312         181.031         144.840         132.676     
##   AIC                                            -246151.098     -259396.041     -269057.362     -280411.918     -280998.018     -299917.892     -307358.680     
##   BIC                                            -246123.052     -259358.647     -268991.921     -280252.990     -280782.998     -299674.826     -307106.266     
##   N                                                84853           84853           84853           84853           84853           84853           84853         
## =================================================================================================================================================================

This linear model can account for 71.9% of the variance in the BorrowerRate.

## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loan_latest)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loan_latest)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination, 
##     data = loan_latest)
## m4: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination + 
##     LoanStatus, data = loan_latest)
## m5: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination + 
##     LoanStatus + EmploymentStatus, data = loan_latest)
## m6: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination + 
##     LoanStatus + EmploymentStatus + amountcut, data = loan_latest)
## m7: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination + 
##     LoanStatus + EmploymentStatus + amountcut + Term, data = loan_latest)
## 
## =================================================================================================================================================================
##                                                        m1              m2              m3              m4              m5              m6              m7        
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)                                          0.279***        0.596***        0.616***        0.628***        0.623***        0.612***        0.574***  
##                                                       (0.000)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)         (0.003)    
##   ProsperScore                                        -0.019***       -0.016***       -0.015***       -0.015***       -0.015***       -0.014***       -0.014***  
##                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   CreditScore                                                         -0.000***       -0.001***       -0.001***       -0.001***       -0.000***       -0.000***  
##                                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   LoanMonthsSinceOrigination                                                           0.004***        0.004***        0.003***        0.003***        0.003***  
##                                                                                       (0.000)         (0.000)         (0.000)         (0.000)         (0.000)    
##   LoanStatus: Completed/Chargedoff                                                                    -0.009***       -0.008***       -0.008***       -0.004*    
##                                                                                                       (0.002)         (0.002)         (0.002)         (0.002)    
##   LoanStatus: Current/Chargedoff                                                                      -0.013***       -0.012***       -0.010***       -0.008***  
##                                                                                                       (0.002)         (0.002)         (0.002)         (0.002)    
##   LoanStatus: Defaulted/Chargedoff                                                                     0.015*          0.013           0.008           0.013*    
##                                                                                                       (0.007)         (0.007)         (0.007)         (0.006)    
##   LoanStatus: FinalPaymentInProgress/Chargedoff                                                       -0.011**        -0.011**        -0.012**        -0.005     
##                                                                                                       (0.004)         (0.004)         (0.004)         (0.003)    
##   LoanStatus: Past Due (>120 days)/Chargedoff                                                         -0.005          -0.005           0.000          -0.006     
##                                                                                                       (0.015)         (0.015)         (0.014)         (0.013)    
##   LoanStatus: Past Due (1-15 days)/Chargedoff                                                         -0.006*         -0.005          -0.004          -0.000     
##                                                                                                       (0.003)         (0.003)         (0.003)         (0.003)    
##   LoanStatus: Past Due (16-30 days)/Chargedoff                                                        -0.002          -0.003          -0.003          -0.000     
##                                                                                                       (0.004)         (0.004)         (0.004)         (0.004)    
##   LoanStatus: Past Due (31-60 days)/Chargedoff                                                         0.001           0.001           0.001           0.004     
##                                                                                                       (0.003)         (0.003)         (0.003)         (0.003)    
##   LoanStatus: Past Due (61-90 days)/Chargedoff                                                         0.004           0.006           0.005           0.006     
##                                                                                                       (0.004)         (0.004)         (0.004)         (0.003)    
##   LoanStatus: Past Due (91-120 days)/Chargedoff                                                        0.004           0.005           0.005           0.006     
##                                                                                                       (0.004)         (0.004)         (0.004)         (0.004)    
##   EmploymentStatus: Full-time/Employed                                                                                 0.002          -0.001          -0.001     
##                                                                                                                       (0.002)         (0.002)         (0.002)    
##   EmploymentStatus: Not employed/Employed                                                                              0.084***        0.072***        0.078***  
##                                                                                                                       (0.004)         (0.004)         (0.004)    
##   EmploymentStatus: Other/Employed                                                                                     0.002**        -0.002**        -0.001     
##                                                                                                                       (0.001)         (0.001)         (0.001)    
##   EmploymentStatus: Part-time/Employed                                                                                -0.009          -0.012          -0.019     
##                                                                                                                       (0.019)         (0.018)         (0.017)    
##   EmploymentStatus: Retired/Employed                                                                                   0.005           0.001           0.002     
##                                                                                                                       (0.006)         (0.006)         (0.005)    
##   EmploymentStatus: Self-employed/Employed                                                                            -0.015***       -0.014***       -0.011***  
##                                                                                                                       (0.001)         (0.001)         (0.001)    
##   amountcut: (4000,10000]/(0,4000]                                                                                                    -0.020***       -0.025***  
##                                                                                                                                       (0.000)         (0.000)    
##   amountcut: (10000,15000]/(0,4000]                                                                                                   -0.021***       -0.029***  
##                                                                                                                                       (0.000)         (0.000)    
##   amountcut: (15000,35000]/(0,4000]                                                                                                   -0.016***       -0.024***  
##                                                                                                                                       (0.001)         (0.001)    
##   Term                                                                                                                                                 0.001***  
##                                                                                                                                                       (0.000)    
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                                            0.568           0.664           0.704           0.705           0.711           0.729           0.762     
##   adj. R-squared                                       0.568           0.664           0.704           0.705           0.711           0.729           0.762     
##   sigma                                                0.039           0.035           0.033           0.033           0.032           0.031           0.029     
##   F                                                58351.104       43873.969       35204.370        8164.812        5757.224        5425.511        6167.393     
##   p                                                    0.000           0.000           0.000           0.000           0.000           0.000           0.000     
##   Log-likelihood                                   80530.088       86117.478       88933.515       89015.274       89490.969       90887.609       93745.795     
##   Deviance                                            69.214          53.818          47.408          47.234          46.233          43.415          38.172     
##   AIC                                            -161054.175     -172226.956     -177857.030     -178000.548     -178939.938     -181727.218     -187441.589     
##   BIC                                            -161028.071     -172192.150     -177813.524     -177870.028     -178757.210     -181518.386     -187224.056     
##   N                                                44415           44415           44415           44415           44415           44415           44415         
## =================================================================================================================================================================

For loans in the last 12 months, the linear model with the same variables will account for 76.2% of the variance.

Multivariate Analysis

Relationships Observed in Investigation

Holding ProsperScore constant, we can see that CreditScore, LoanOriginalAmount have negtive relationship with BorrowerRate, and LoanStatus, LoanMonthsSinceOrigination and Term all have some influence on BorrwerRate and the relationship between BorrwerRate and ProsperScore.

Interesting Interactions Between Features

From the plots, we can see that in different periods of time, the ProsperScore has different relationships with BorrwerRate. For loans in the last 12 months, the relationship between BorrowerRate and ProsperScore is more clear and linear. Also, the relationship between BorrowerRate and CreditScore is more clear in the last 12/13 months.

For loans of different amounts, the relationships between BorrowerRate and ProsperScore are different.

Linear Model

I created a linear model to predict the BorrowerRate. This linear model account for 71.9% of the total variance of BorrowerRate. ProsperScore account for 42.2% of the variance. The variables CreditScore, amountcut (LoanMonthsSinceOrigination), LoanStatus, monthcut (LoanMonthsSinceOrigination), Term each improve the R^2 value by a few percent. Surprisingly, EmploymentStatus didn’t improve the model very much.

27.8% of the variance can’t be explained. One reason may be that some relationships between the BorrowerRate and other variables are not linear. Another reason is that the model is different in different periods of time. Another reason may be that there are factors that are not considered here.

According to the plot, the model to predict the BorrowerRate may be quite different in different periods of time. If I build the model just for loans started in the last 12 months. The model accounts for 76.2% of the variance, and ProsperScore alone account for 56.8% of the variance, which is quite different from the model from the model for all loans with ProsperScore. This means that BorrowerRate is more closely related to ProsperScore in the last 12 months.

Final Plots and Summary

Plot One

Description One

The majority of observations have the BorrowRate between 0.05 and 0.35. The interval near 0.32 has an unusual high frequency. Except for 0.32, the distribution peaks around 0.15.

Plot Two

Description Two

The mean of BorrowerRate (red line) decreases aas ProsperScore increases. The variance of BorrowerRate is big, but the conditional mean shows a strong linear relationship.

Plot Three

Description Three

If we count for constant ProsperScore value, the pattern shows that loans with higher CreditScore tend to have lower BorrowerRate across each level of LoanOriginalAmount. But the specific pattern between BorrowerRate, ProsperScore and CreditScore is different for different levels of LoanOriginalAmount. Loans with smaller original amount tend to have higher BorrowerRate and bigger variance of BorrowerRate.


Reflection

This data set contains 113,937 loans with 81 variables on each loan. I select 13 variables to explore. I first explored the distribution of each variable, and then I explored the relationships between 2 variables. Finally, I explore the BorrowerRate across many variables and created a linear model.

The BorrowerRate is strongly negatively related with ProsperScore. Holding ProsperScore constant, CreditScore, LoanOriginalAmount have negtive relationship with BorrowerRate, and LoanStatus, LoanMonthsSinceOrigination and Term all have some influence on BorrwerRate and the relationship between BorrwerRate and ProsperScore. Most of the variables I explored have some level of relationship with the BorrowerRate and helped improved the model. But EmploymentStatus doesn’t seem to help much.

I built a linear model to predit the BorrowerRate. I tried to add many variables to linear model to improve the model. But the R^2 value of the linear model is still not very high. I think for future work, maybe the model should be more than a linear model, and different models should be made for different periods of time. We should also explore more variables from the dataset in the future.